Author

Arthur Andrews

Goals

  • Study the aging of MLB player performance
  • Reveal the bias of estimating aging curves with simple aggregation
  • Demonstrate a statistics model for aging using splines and mixed effects modeling
  • Show some players whose aging curves are unusual or exceptional

Purpose

This is a hobby project.

Tools

  • R - programming language
  • baseballr - an R interface to MLB data
  • quarto - publishing data-rich documents from code
  • GitHub Pages - hosting code and web pages

Code

By default, the R code is displayed. If you wish, you may hide it from the “Code” menu in the top-right.

Dependencies

Code
library(tidyverse)
library(broom.mixed)
library(janitor)
library(splines)
library(factoextra)
library(reactable)
library(ggsci)
library(magrittr, include.only = "set_rownames")
library(lme4, exclude = c("expand", "pack", "unpack"))
library(htmltools, include.only = "css")
library(pracma, include.only = c("fzero", "fderiv"))

theme_set(theme_bw(base_size = 12))
Code
load("data/hitting_stats.RData")

Declare functions

For predicting with a spline function and predicting with the stats model.

Code
predict_splines <- function(x, model) {
  x |> 
    predict(object = model) |> 
    as_tibble() |> 
    mutate(across(everything(), as.numeric)) |> 
    rename_with(.fn = \(x) str_c("spline", x))    
}


add_splines <- function(hit, model) {
  hit |> 
    select(-contains("spline")) |> 
    bind_cols(
      hit$centered_age |> 
        predict_splines(model)
    )
}


add_prediction <- function(hit, model, ...) {
  hit$pred_ops <- predict(model, hit, ...)
  hit
}


tabulate <- partial(
  reactable,
  fullWidth = FALSE, highlight = TRUE, style = css(fontSize = "80%")
)

Data set

I accessed MLB hitter data from 2010-2023 using the excellent baseballr package, which offers convenient interfaces to the Statscast APIs to Baseball Reference data.

I include - seasons with at least 200 plate appearances - and hitters with at least 5 seasons

The data loaded by this notebook was first prepared in a separate script that you may find in the code repository: query_mlb.R.

Code
hit <- hit |> 
  filter(n_seasons >= 5) 
Code
hit |> 
  group_by(season) |> 
  summarize(rows = n(), hitters = n_distinct(id)) |> 
  tabulate(defaultPageSize = 15)

Print a glimpse of the data to see the schema and number of rows and columns.

Code
glimpse(hit)
Rows: 4,285
Columns: 13
$ id           <int> 110029, 110029, 110029, 110029, 110029, 110029, 110029, 1…
$ name         <chr> "Bobby Abreu", "Bobby Abreu", "Bobby Abreu", "Bobby Abreu…
$ season       <chr> "2005", "2006", "2007", "2008", "2009", "2010", "2011", "…
$ team_name    <chr> "Philadelphia Phillies", "New York Yankees", "New York Ya…
$ ab           <int> 588, 548, 605, 609, 563, 573, 502, 219, 575, 543, 417, 55…
$ pa           <int> 719, 686, 699, 684, 667, 667, 585, 257, 603, 588, 450, 59…
$ age          <dbl> 31.30732, 32.30664, 33.30595, 34.30801, 35.30732, 36.3066…
$ avg          <dbl> 0.286, 0.297, 0.283, 0.296, 0.293, 0.255, 0.253, 0.242, 0…
$ obp          <dbl> 0.405, 0.424, 0.369, 0.371, 0.390, 0.352, 0.353, 0.350, 0…
$ slg          <dbl> 0.474, 0.462, 0.445, 0.471, 0.435, 0.435, 0.365, 0.342, 0…
$ ops          <dbl> 0.879, 0.886, 0.814, 0.842, 0.825, 0.787, 0.718, 0.692, 0…
$ n_seasons    <int> 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, …
$ centered_age <dbl> 2.158141, 3.157457, 4.156772, 5.158826, 6.158141, 7.15745…

Splines

I will use three B-splines to provide a flexible basis for describing the nonlinear aging curve.

Each of these splines is a smooth function of player age, and by including the split fits as predictor variables in a model, the model will be able to describe a nonlinear relationship between age and OPS. The coefficients will define the shape of the relationship.

With the splines package, the spline definition can very conveniently go inside the model formula.

Model

This model describes the player OPS as a function of age (via the splines) and player. The player name is a random effect in the model.

Random effects in statistics are a good way to include categorical variables with many levels as predictors.

Code
model1 <- lmer(
  ops ~ (1 | name) + bs(centered_age, df = 3),
  data = hit
)

model1
Linear mixed model fit by REML ['lmerMod']
Formula: ops ~ (1 | name) + bs(centered_age, df = 3)
   Data: hit
REML criterion at convergence: -8955.141
Random effects:
 Groups   Name        Std.Dev.
 name     (Intercept) 0.06695 
 Residual             0.07500 
Number of obs: 4285, groups:  name, 543
Fixed Effects:
              (Intercept)  bs(centered_age, df = 3)1  
                 0.720701                   0.147092  
bs(centered_age, df = 3)2  bs(centered_age, df = 3)3  
                -0.002604                  -0.153229  

Visualize the spline functions

Code
bind_cols(
  hit |> 
    select(age),
  hit$centered_age |> 
    bs(3)
) |> 
  pivot_longer(-age, names_to = "spline") |> 
  ggplot() + aes(age, value, color = spline) + geom_line()

Code
pal <- pal_d3()(10)

mean_player <- hit |> 
  add_prediction(model1, re.form = ~0)

mean_player |> 
  ggplot() + aes(age, pred_ops) + geom_line(color = pal[2]) + 
  labs(title = "modeled aging curve", y = "ops") +
  scale_y_continuous(labels = scales::label_number(accuracy = 0.001))

Model coefficients

Code
model1 |> 
  tidy("fixed") |> 
  mutate(across(where(is.numeric), \(x) round(x, 3))) |> 
  tabulate()

The random effect intercept is the +/- of each player’s aging curve relative to the mean player. Unsurprisingly, tabulating these players by decreasing coefficients lists the greatest hitters of the era.

Code
model1 |> 
  tidy("ran_vals") |> 
  mutate(across(where(is.numeric), \(x) round(x, 3))) |> 
  arrange(desc(estimate)) |> 
  tabulate(columns = list(level = colDef(minWidth = 150)))

Calculate the age of peak OPS by numerically solving for the zero of the 1st derivative with respect to age.

Code
predict_ops <- function(age, model) {
  tibble(centered_age = age - mean(hit$age)) |> 
    predict(object = model, re.form = ~0)
}

calc_ops_deriv <- function(age, model) {
  partial(predict_ops, model = model) |> # now a function of only age
    fderiv(age)
}

calc_zero_deriv <- function(model) {
  partial(calc_ops_deriv, model = model) |> # now a function of only age
    fzero(x = 30) |> 
    pluck("x")
}

model1 |> 
  calc_zero_deriv() |> 
  cat()
26.81941

Aggregation

Instead of performing the more complex steps of engineering the splines and fitting the statistical model, it might be tempting to simply average OPS by player age to see the aging curve.

This requires no statistics, but we’ll quickly find a problem.

Code
grouped_hit <- hit |> 
  group_by(age = round(age)) |> 
  mutate(player_contribution = pa / sum(pa)) 

aggregate_player <- grouped_hit |> 
  summarize(
    across(c(avg, obp, slg, ops), \(x) weighted.mean(x, pa)),
    .groups = "drop"
  ) 
Code
aggregate_player |> 
  ggplot() + aes(age, ops) + geom_line(color = pal[1]) +
  labs(title = "ops aggregated by player age")

This result looks substantially different from the fit aging curve.

That’s because it’s subject to an omitted variable bias. The players at the low age ranges (19, 20) and high (37, 38, …) are all-star and hall-of-fame caliber players. They distort the averages in these age ranges.

Code
tabulate_one_age <- function(ages) {
  grouped_hit |> 
    filter(age %in% ages) |> 
    slice_max(player_contribution, n = 5) |> 
    select(age, name, player_contribution) |> 
    tabulate(
      columns = list(
        name = colDef(minWidth = 150),
        player_contribution = colDef(
          "player contribution", 
          format = colFormat(percent = TRUE, digits = 1)
        )
      )
    )  
}
tabulate_one_age(20)
tabulate_one_age(21)
Code
Code
tabulate_one_age(38)
tabulate_one_age(39)
Code
Code

Model/aggregate comparison

Code
aging <- aggregate_player |> 
  mutate(
    centered_age = age - mean(hit$age), 
    pred_ops = tibble(centered_age) |> 
      predict(object = model1, re.form = ~0)
  ) 

aging |> 
  select(age, modeled = pred_ops, aggregate = ops) |> 
  pivot_longer(-age) |> 
  mutate(name = factor(name, c("aggregate", "modeled"))) |> 
  ggplot() + aes(age, value, color = name) + geom_line() + scale_color_d3() +
  labs(y = "ops", title = "modeled and aggregate ops aging")

A more complex model

By including a random slope for centered_age, we can allow different players to have different aging curves.

The partial pooling of mixed effects models is really nice for this, because it will penalize really extreme values of this slope. To me, this is a desirable feature - to keep the slopes realistic, even for players with a smaller amount of data.

Typically, including a random slope requires also adding a corresponding fixed effect. In this case, however, I will rely on the spline fits to fill the role of the fixed effect. This will maintain the population mean centered_age slope at zero.

Each player will be given a centered_age coefficient which modifies their aging curve.

Code
model2 <- lmer(
  ops ~ (centered_age | name) + bs(centered_age, df = 3),
  data = hit
) 

model2
Linear mixed model fit by REML ['lmerMod']
Formula: ops ~ (centered_age | name) + bs(centered_age, df = 3)
   Data: hit
REML criterion at convergence: -9016.766
Random effects:
 Groups   Name         Std.Dev. Corr
 name     (Intercept)  0.066054     
          centered_age 0.005782 0.07
 Residual              0.072703     
Number of obs: 4285, groups:  name, 543
Fixed Effects:
              (Intercept)  bs(centered_age, df = 3)1  
                  0.71586                    0.15025  
bs(centered_age, df = 3)2  bs(centered_age, df = 3)3  
                  0.01849                   -0.19515  

Looking at the players with the most extreme values of the centered_age coefficient is interest: it reveals the players whose performance “aged” the fastest (negative values) and slowest (positive values).

Code
cf <- model2 |> 
  tidy("ran_vals") |> 
  filter(term == "centered_age") |> 
  mutate(across(where(is.numeric), \(x) round(x, 4))) |> 
  arrange(desc(estimate)) 

cf |> 
  tabulate(columns = list(level = colDef(minWidth = 150)))

In some cases, players aging faster than average is due to historically high peaks.

Code
predict_player <- function(players, model) {
  
  mean_player <- hit |> 
    distinct(age, centered_age) |> 
    mutate(
      mean_player = tibble(centered_age) |> 
        predict(object = model, re.form = ~0)
    )
  
  hit |> 
    filter(name %in% players) |> 
    left_join(cf, by = join_by(name == level)) |> 
    mutate(
      player = tibble(name, age, centered_age) |> 
        predict(object = model),
      name = name |> 
        reorder(-abs(estimate))
    ) |> 
    ggplot() + 
    aes(x = age, color = "player") + geom_point(aes(y = ops)) + geom_line(aes(y = player)) +
    geom_line(aes(x = age, y = mean_player, color = "mean_player"), data = mean_player) +
    scale_color_manual(values = c("player" = "grey30", "mean_player" = pal[2])) +
    facet_wrap(~name + estimate)
}

predict_player("Miguel Cabrera", model2)

Players aging gracefully

Here are the aging curves for the 16 players with the slowing aging.

Code
cf |> 
  slice_max(estimate, n = 16, with_ties = FALSE) |> 
  pull(level) |> 
  predict_player(model2)

Players aging rapidly

Here are the aging curves for the 16 players with the most rapid aging.

In many cases, these were players with All-Star 20’s, but average or below-average 30’s.

Code
cf |> 
  slice_min(estimate, n = 16, with_ties = FALSE) |> 
  pull(level) |> 
  predict_player(model2)

Conclusions

Splines and mixed effects models are a great choice for studying the aging of MLB player performance.

The splines can describe a nonlinear aging curve, and mixed effects models are ideal for working with categorical variables (in this case, player) with many levels.